Make my own bioinformatics parser with Automa.jl
Design
⇾The chain format for each record could be briefly part into three:
The header line of chain score tName tSize tStrand tStart tEnd qName qSize qStrand qEnd id;
The alignment body of size dt dq;
The alignment last line of size, and a blank line.
⇾The parser thus could be devided into four part:
The header parsing step: add actions for each cell in header line, return the whole header var
The alignment parsing step: add actions for each alignment body, push each alignment record to alignment var
The alignment end parsing step: detect the alignment end, return the alignment var
The record parsing step: detect the end of the record, return the whole record var, maybe convert during this step
Automa: for parser
GenomicFeatures: for converter
TranscodingStreams: for IOStream
Implementation
Define Structure
The final storage structure: ChainMap <: IntervalCollection
The chain format needs to be parsed as a set of IntervalCollection
([ chr:start-end:strand:value, ... ]), with the ability to pair each target interval to its query interval matched. In a similar implementation (CrossMap
) in python, the query regions was passed as the values of the target regions (as Strings) :How CrossMap.py store chain maps. In julia, I think we could do it more effectively, by nesting the IntervalTree
of query as the value of the IntervaTree
of target.
using GenomicFeatures
using Formatting
julia> demo_array = [
Interval(
"tchr1",
j,
j+100,
'+',
Interval("qchr1", j+5, j+105, '+', sprintf1("QID%02d", i))
) for (i, j) in enumerate(1:10)
]
julia> typeof(demo_array)
Vector{Interval{Interval{String}}} (alias for Array{Interval{Interval{String}}, 1})
julia> demo_map = IntervalCollection(demo_array);
julia> typeof(demo_map)
IntervalCollection{Interval{String}}
The intermediate structure: ChainRecord
Like those in Bed.jl, we would define a structure that stores all IOStreams
related to one chain record, and the positions of each item in the IOStream
, and define a function to convert it to the ChainMap
format.
mutable struct Record
# ChainRecord, data and filled range
data ::Vector{UInt8}
filled ::UnitRange{Int}
# number of align subintervals
naln ::Int
# indexes for header line
score ::UnitRange{Int}
tname ::UnitRange{Int}
tsize ::UnitRange{Int}
tstrand::UnitRange{Int}
tstart ::UnitRange{Int}
tend ::UnitRange{Int}
qname ::UnitRange{Int}
qsize ::UnitRange{Int}
qstrand::UnitRange{Int}
qstart ::UnitRange{Int}
qend ::UnitRange{Int}
id ::UnitRange{Int}
# indexes for aligns
alndata::Vector{UnitRange{Int}}
alnend ::UnitRange{Int}
end
:x CrossMapUtilL331L334
Find the source here
if target_strand == '+':
maps[source_name].add_interval( Interval(sfrom, sfrom+size,(target_name,
tfrom, tfrom+size,target_strand)))
elif target_strand == '-':
maps[source_name].add_interval( Interval(sfrom, sfrom+size,(target_name,
target_size - (tfrom+size), target_size - tfrom, target_strand)))